So one of the things that I would like to briefly get into is we call passive learning.
So we make things simple.
We go for a fully observable environment, partially observable reward function, and
we want to find optimal policies.
It's just basically MDP with a wrinkle, and basically what we have to do, and that's sufficient,
is we have to do the same as we do in the policy evaluation subtasks of policy iteration
for MDPs.
We want to find out what is the best policy here.
We can do that with our example we also had in policy iteration.
Say we have this kind of environment where we had experimented with the rewards, and
you remember that we had policies here given by the arrows, namely what you should do when
you get into that field, and you have these two plus and minus one rewards.
In one of these examples we had a minus 0.4 disincentive for loitering around.
That gives rise to this optimal policy.
You don't quite want to run away here, and you want to prefer the plus one exit.
This policy, remember, was if you have a full reward function, actually gives rise to the
utilities of being in a certain place.
How do you find out?
Well, you just imagine yourself in that, in a certain state, and then you run according
to the policy.
In this case if you're in 1-1 you would run up and then over and land in plus one, and
if you add up the reward, which is plus one minus a couple of times minus 0.04, then you
end up with these values.
If we only have a partially observable reward function, let's say where we can't really
get these little things, then you can still do things.
What you would do is you would run what we call trials.
As an agent you can actually put yourself into the situation, play a game of chess,
live a life, study AI or whatsoever, and then at some point you're going to get the reward.
In this case, say when you hit an exit.
What you would do, you would make trials where you start since state 1-1 and until it goes
into a reinforcement state, one where you actually get a reinforcement, or for that
matter just go until the end.
Then you sense the rewards and then you reason backwards from that.
That's the idea.
You have a couple of trials, I've written down a couple of those, a couple of paths
through this policy.
Then you can just define what the utility is.
You just basically see at some point you see the reward and that gives you a utility and
that you would think of as a sample of the utility function.
Just like we had sampling in Markov search where you basically ran all the way down,
which means simulate a game, see whether you win or lose, and then you use that as a sample
for the utilities upstairs.
You can define a utility which you can sample and that you can use for learning.
What we've essentially have done, we've defined the delayed reward which we can sample and
then if we have enough samples, then that gives us enough utilities so that we can do
value iteration essentially and use MDP techniques to do unsupervised learning
or reinforcement learning.
That's something we've done quite a lot where you basically use another technique, lift
it up one level and use the old algorithm on some kind of an induced learning space.
That's something you can do and that's called direct utility estimation and the algorithms
Presenters
Zugänglich über
Offener Zugang
Dauer
00:10:46 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-31 08:16:52
Sprache
en-US
Explanation of Passive Learning using an example. Also, Adaptive Dynamic Programming and its algorithm is discussed.